Goto

Collaborating Authors

 proximity data


Multidimensional scaling of two-mode three-way asymmetric dissimilarities: finding archetypal profiles and clustering

Alcacer, Aleix, Benitez, Rafael, Bolos, Vicente J., Epifanio, Irene

arXiv.org Machine Learning

Multidimensional scaling visualizes dissimilarities among objects and reduces data dimensionality. While many methods address symmetric proximity data, asymmetric and especially three-way proximity data (capturing relationships across multiple occasions) remain underexplored. Recent developments, such as the h-plot, enable the analysis of asymmetric and non-reflexive relationships by embedding dissimilarities in a Euclidean space, allowing further techniques like archetypoid analysis to identify representative extreme profiles. However, no existing methods extract archetypal profiles from three-way asymmetric proximity data. This work extends the h-plot methodology to three-way proximity data under both symmetric and asymmetric, conditional and unconditional frameworks. The proposed approach offers several advantages: intuitive interpretability through a unified Euclidean representation; an explicit, eigenvector-based analytical solution free from local minima; scale invariance under linear transformations; computational efficiency for large matrices; and a straightforward goodness-of-fit evaluation. Furthermore, it enables the identification of archetypal profiles and clustering structures for three-way asymmetric proximities. Its performance is compared with existing models for multidimensional scaling and clustering, and illustrated through a financial application. All data and code are provided to facilitate reproducibility.


Out-of-Sample Embedding with Proximity Data: Projection versus Restricted Reconstruction

Trosset, Michael W., Tan, Kaiyi, Tang, Minh, Priebe, Carey E.

arXiv.org Machine Learning

The problem of using proximity (similarity or dissimilarity) data for the purpose of "adding a point to a vector diagram" was first studied by J.C. Gower in 1968. Since then, a number of methods -- mostly kernel methods -- have been proposed for solving what has come to be called the problem of *out-of-sample embedding*. We survey the various kernel methods that we have encountered and show that each can be derived from one or the other of two competing strategies: *projection* or *restricted reconstruction*. Projection can be analogized to a well-known formula for adding a point to a principal component analysis. Restricted reconstruction poses a different challenge: how to best approximate redoing the entire multivariate analysis while holding fixed the vector diagram that was previously obtained. This strategy results in a nonlinear optimization problem that can be simplified to a unidimensional search. Various circumstances may warrant either projection or restricted reconstruction.


Complex-valued embeddings of generic proximity data

Münch, Maximilian, Straat, Michiel, Biehl, Michael, Schleif, Frank-Michael

arXiv.org Machine Learning

Proximities are at the heart of almost all machine learning methods. If the input data are given as numerical vectors of equal lengths, euclidean distance, or a Hilbertian inner product is frequently used in modeling algorithms. In a more generic view, objects are compared by a (symmetric) similarity or dissimilarity measure, which may not obey particular mathematical properties. This renders many machine learning methods invalid, leading to convergence problems and the loss of guarantees, like generalization bounds. In many cases, the preferred dissimilarity measure is not metric, like the earth mover distance, or the similarity measure may not be a simple inner product in a Hilbert space but in its generalization a Krein space. If the input data are non-vectorial, like text sequences, proximity-based learning is used or ngram embedding techniques can be applied. Standard embeddings lead to the desired fixed-length vector encoding, but are costly and have substantial limitations in preserving the original data's full information. As an information preserving alternative, we propose a complex-valued vector embedding of proximity data. This allows suitable machine learning algorithms to use these fixed-length, complex-valued vectors for further processing. The complex-valued data can serve as an input to complex-valued machine learning algorithms. In particular, we address supervised learning and use extensions of prototype-based learning. The proposed approach is evaluated on a variety of standard benchmarks and shows strong performance compared to traditional techniques in processing non-metric or non-psd proximity data.


Classification on Pairwise Proximity Data

Graepel, Thore, Herbrich, Ralf, Bollmann-Sdorra, Peter, Obermayer, Klaus

Neural Information Processing Systems

We investigate the problem of learning a classification task on data represented in terms of their pairwise proximities. This representation does not refer to an explicit feature representation of the data items and is thus more general than the standard approach of using Euclidean feature vectors, from which pairwise proximities can always be calculated. Our first approach is based on a combined linear embedding and classification procedure resulting in an extension of the Optimal Hyperplane algorithm to pseudo-Euclidean data. As an alternative we present another approach based on a linear threshold model in the proximity values themselves, which is optimized using Structural Risk Minimization. We show that prior knowledge about the problem can be incorporated by the choice of distance measures and examine different metrics W.r.t.


Classification on Pairwise Proximity Data

Graepel, Thore, Herbrich, Ralf, Bollmann-Sdorra, Peter, Obermayer, Klaus

Neural Information Processing Systems

We investigate the problem of learning a classification task on data represented in terms of their pairwise proximities. This representation doesnot refer to an explicit feature representation of the data items and is thus more general than the standard approach of using Euclideanfeature vectors, from which pairwise proximities can always be calculated. Our first approach is based on a combined linear embedding and classification procedure resulting in an extension ofthe Optimal Hyperplane algorithm to pseudo-Euclidean data. As an alternative we present another approach based on a linear threshold model in the proximity values themselves, which is optimized using Structural Risk Minimization. We show that prior knowledge about the problem can be incorporated by the choice of distance measures and examine different metrics W.r.t.


Active Data Clustering

Hofmann, Thomas, Buhmann, Joachim M.

Neural Information Processing Systems

Active data clustering is a novel technique for clustering of proximity data which utilizes principles from sequential experiment design in order to interleave data generation and data analysis. The proposed active data sampling strategy is based on the expected value of information, a concept rooting in statistical decision theory. This is considered to be an important step towards the analysis of largescale data sets, because it offers a way to overcome the inherent data sparseness of proximity data.


Active Data Clustering

Hofmann, Thomas, Buhmann, Joachim M.

Neural Information Processing Systems

Active data clustering is a novel technique for clustering of proximity data which utilizes principles from sequential experiment design in order to interleave data generation and data analysis. The proposed active data sampling strategy is based on the expected value of information, a concept rooting in statistical decision theory. This is considered to be an important step towards the analysis of largescale data sets, because it offers a way to overcome the inherent data sparseness of proximity data.


Active Data Clustering

Hofmann, Thomas, Buhmann, Joachim M.

Neural Information Processing Systems

Active data clustering is a novel technique for clustering of proximity datawhich utilizes principles from sequential experiment design in order to interleave data generation and data analysis. The proposed activedata sampling strategy is based on the expected value of information, a concept rooting in statistical decision theory. This is considered to be an important step towards the analysis of largescale datasets, because it offers a way to overcome the inherent data sparseness of proximity data.